OC-IA-P7 Neural Network training

This notebooks aims at locally training a neural network for sentiment analysis, before deployment on Azure.

We'll compare :

Preprocess data

Extract data and get a shuffled balanced sample of 10 000 tweets

Split dataset

Normalize data

Re-label sentiment feature (target)

Since for us the positive case is the case of negative/unhappy sentiment, we turn the "sentiment" column into expected values:

Clean text

Text must be cleaned before embedding. We'll remove:

The we'll apply stemming or lemmatization to enhance the model performance. We'll compare performance of both methods through the model result. Here is an example of each preprocessing method:

test_string = "@mimi2000 We, finally!: went to the shopping) 12centers! 34"
print('Test string:')
print(test_string)
print('\nPreprocessed string with lemmatization:')
print(DataPreprocessor(normalization='lem')._normalize_text(test_string))
print('\nPreprocessed string with stemming:')
print(DataPreprocessor(normalization='stem')._normalize_text(test_string))
print('\nPreprocessed string with no stemming/lemmaization:')
print(DataPreprocessor(normalization='keep')._normalize_text(test_string))
Test string:
@mimi2000 We, finally!: went to the shopping) 12centers! 34

Preprocessed string with lemmatization:
Loading vectors for word2vec model, please wait...
Vectors loaded.
we finally go shopping center

Preprocessed string with stemming:
Loading vectors for word2vec model, please wait...
Vectors loaded.
we final went shop center

Preprocessed string with no stemming/lemmaization:
Loading vectors for word2vec model, please wait...
Vectors loaded.
We finally went shopping centers

Embedding

For our first try, we'll use pre-trained Word2vec English model from Gensim.

To embed whole sentences, we'll average the vectors of each word.

Our function is ready to preprocess each dataset:

Train models

Now that we have cleaned the data, we can create the model:

Baseline classifier: linear regression

Simple neural network

click here to go TensorBoard

The recall oscillates a lot. Maybe tuning the batch size will help? Let's train the model with different batch sizes, then, for each serie of resulting val_recall, compute its standard deviation:

Batch size does not seem to be significant to reduce val_recall oscillations. Maybe another activation function may help?

We notice that the model converges muche faster with SELU activation function, but it stil oscillates a lot.

We used lemmatization and Word2vec embedding. Let's compare with other normalizing and embedding methods.

Find best preprocessing methods

Lemmatization with glove embedding seems to be the best combination, we'll use it for next steps. But there is not a huge difference between all combinations.

Tuning hyperparameters

Since the recall is rather wobbly, we won't monitor it for tuning hyperparameters, but val_loss instead.

Now that the tuner has found good parameters, we can use them in our model:

This is a little better than our baseline recall on a simple logistic regression (72%).